Quantifying Tackeling in Football

A Data-Driven Approach Using the NFL Big Data Bowl Dataset and Advanced Machine Learning Techniques

Dusty Turner

A Quick Reminder

Research Hypothesis


Research Question: Can we determine each defensive player’s probability that they make a tackle on each play on the football field?

Ultimately: Assign a ‘tackles over expected’ value for each player.

Literature Review

Previous NFL Big Data Bowl Competitions

  • 2020: How many yards will an NFL player gain after receiving a handoff?
  • 2021: Evaluate defensive performance on passing plays
  • 2022: Evaluate special teams performance
  • 2023: Evaluate linemen on pass plays

Data

Player & Game Identifiers

  • Game and Play IDs: Unique identifiers for games and individual plays
  • Player Information: Names, jersey numbers, team, position, physical attributes, college

In-Game Player Movements

  • Spatial Data: Player positions, movement direction, speed, and orientation
  • Time and Motion: Specific moments in play, distance covered

Detailed Play Information

  • Play Attributes: Description, quarter, down, yards needed
  • Team & Field Position: Possessing team, defensive team, yardline positions

Scoring and Game Probabilities

  • Scores & Results: Pre-snap scores, play outcomes
  • Probabilities: Win probabilities for home and visitor teams
  • Expected Points: Points added or expected by play outcomes

Tackles, Penalties, and Formations

  • Tackles & Fouls: Tackles, assists, fouls committed, and missed tackles
  • Ball Carrier Info: Identifiers and names of ball carriers
  • Team Formations: Offensive formations and number of defenders

Feature Development

Feature Development

Positions

Alignment Clusters

Modeling

Rows: 393,536
Technique: Group Splitting

Factors to Consider:
- Tackle (0/1)
- Future X/Y
- S/A/O/Dir of defender
- Position / Alignment Cluster
- Number of Defenders in the Box
- Current and future (.5 seconds) location of the ball
- O/S/A/Dir of ball carrier
- Velocity/direction difference
- Ball in defensive players ‘fan’

Models:
- Penalized Regression (Train: 8019; Test: 1738)
- Random Forest (Train: 8019; Test: 1738)
- XGBoost (Train: 8019; Test: 1738)
- Neural Network (Train 231,110; Test: 90,090)

Baseline Accuracy:
- Non-Neural Net: 92.85%
- Neural Net: 92.94%

Penalized Regression

\[\text{Minimize } \left\{ \frac{1}{N} \sum_{i=1}^{N} (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \lambda \left[ \frac{1 - \alpha}{2} \|\boldsymbol{\beta}\|_2^2 + \alpha \|\boldsymbol{\beta}\|_1 \right] \right\}\]

Penalized Regression

The best parameters are: Lambda = 0.00853 and Alpha = 2.21^{-5} with an accuracy of 91.14%

Random Forest

Random Forest

The best parameters are: Mtry = 6, Min_n = 6, and Trees = 1848 with an accuracy of 92.86%.

XGBoost

XGBoost

The best parameters are: Trees = 652, Min_n = 8, tree_depth = 2, Learn Rate = 1.1, and Loss Reduction = 2.9078823^{9} with an accuracy of 92.86%.

Neural Network

Points Above or Below Expected



\(\sum_{i=1}^{N} (\mathbb{I}_{\text{tackle}_i} - P(\text{tackle}_i))\)

Where:

  1. \(N\) is the total number of plays
  2. \(P(\text{tackle}_i)\) is the probability of a tackle on play \(i\)
  3. \(\mathbb{I}_{\text{tackle}_i}\) is the indicator function which is 1 if a tackle occurred on play \(i\) and 0 otherwise